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1 Introduction 



We introduce a new efficient algorithm for the phylogenetic tree reconstruction 
(PTR) problem which rigorously accounts for insertions and deletions. 

Phylogenetic background A phylogenetic tree or phylogeny is a tree represent- 
ing the speciation history of a group of organisms. The leaves of the tree are typi- 
cally existing species. The root corresponds to their most recent common ancestor 
(MRCA). Each branching in the tree indicates a speciation event. It is common 
to assume that DNA evolves according to a Markovian substitution process on 
this phylogeny. Under such a model, a gene is a sequence in {A, G, C, T}'^. Along 
each edge of the tree, each site independently mutates according to a Markov rate 
matrix. The length of a branch is a measure of the amount of substitution along 
that branch. The precise definition of a branch length depends on the model of 
evolution. For roughly constant mutation rates, one can think of the branch length 
as proportional to the amount of time elapsed along a branch. The PTR prob- 
lem consists in estimating a phylogeny from the genes observed at its leaves. We 
denote the leaves of a tree hy [n] = {1, . . . ,n} and their sequences by cti, . . . , cr„. 

The model of sequence evolution above is simplistic: it ignores many mu- 
tational events that DNA undergoes through evolution. At the gene level, the 
most important omissions are insertions and deletions of sites, also called in- 
dels. Stochastic models taking indels into account have long been known [TKF91, 
TKF92], but they are not widely used in practice — or in theory — ^because of their 
complexity. Instead, most practical algorithms take a two-phase approach: 

1. Multiple sequence alignment. Site ti of sequence ai and site tj of se- 
quence CTj are said to be homologous if they descend from the same site to 
of a common ancestor u (not necessarily the MRCA) only through substitu- 
tions. In the multiple sequence alignment (MSA) problem, we seek roughly 
to uncover the homology relation between cii, . . . , (T„. Typically, the output 
is represented by a matrix D of n aligned sequences of equal length with 
values in {A, G, C, T, — }. Each column of the matrix corresponds to homol- 
ogous sites. The state — is called a gap and is used to account for insertions 
and deletions. For instance if sequence ai does not have a site correspond- 
ing to to in u above, then a gap is aligned with positions ti of and tj of Gj 
(which belong to the same column). 

2. Phylogenetic tree reconstruction. The matrix D is then cleaned up by re- 
moving all columns containing gaps. Let D' be this new matrix. A standard 
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PTR algorithm is then appUed to D'. Note that substitutions alone suffice to 
explain D'. 

Traditionally, most of the research on phylogenetic methods has focused on the 
second phase. 

In fact, current theoretical analyses of PTR assume that the MSA problem has 
been solved perfectly. This has been a long-standing assumption in evolutionary 
biology. But this simplification is increasingly being questioned in the phyloge- 
netic literature, where it has been argued that alignment heuristics often create 
systematic biases that affect analysis [LG08, WSH08]. Much recent empirical 
work has been devoted to the proper joint estimation of alignments and phylo- 
genies [TKF91, TKF92, Met03, MLH04, SR06, RE08, LG08, LRN+09]. Here, 
we give the first analysis of an efficient, provably consistent PTR algorithm in the 
presence of indels. Our new algorithm suggests that a rough alignment suffices for 
an accurate tree reconstruction — ^bypassing the computationally difficult multiple 
aUgnment problem. 

Theoretical properties of PTR In addition to computational efficiency, an im- 
portant theoretical criterion in designing a PTR algorithm is the so-called sequ- 
ence-length requirement (SLR). At a minimum, a reconstruction algorithm should 
be consistent, that is, assuming a model of sequence evolution, the output should 
be guaranteed to converge on the true tree as the sequence length k — the num- 
ber of samples — goes to +00 [Fel78]. Beyond consistency, the sequence-length 
requirement — or convergence rate — of a PTR algorithm is the sequence length re- 
quired for guaranteed high-probability reconstruction. The SLR is typically given 
as an asymptotic function of n, the number of leaves of the tree. Of course, it also 
depends on the substitution parameters. 

A classical result due to Erdos et al. [ESSW99a] states that, for general trees 
under the assumption that all branch lengths are bounded by constants, the so- 
called Short Quartet Method (SQM) has poly(n)-SLR. The SQM is a particu- 
lar PTR algorithm based on estimating evolutionary distances between the leaf 
taxa, that is, the sum of the branch lengths between species. Such algorithms are 
known as distance-based methods. The basic theoretical result behind distance- 
based methods is the following: the collection of pairwise evolutionary distances 
between all species forms a special metric on the leaves known as an additive met- 
ric; under mild regularity assumptions, such a metric characterizes the underlying 
phylogeny interpreted as an edge-weighted tree, that is, there is a one-to-one cor- 
respondence between additive metrics and phylogenies; moreover, the mapping 
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between them can be computed efficiently [SS03]. 

A new approach In the classical theoretical setting above where the MSA prob- 
lem is assumed perfectly solved (we refer to this setting below as the ESSW 
framework), the evolutionary distance between two species is measured using the 
Hamming distance (or a state-dependent generalization) between their respective 
sequences. It can be shown that after a proper correction for multiple substitu- 
tions (which depends on the model used) the expectation of the quantity obtained 
does satisfy the additive metric property and can therefore serve as the basis for a 
distance-based PTR algorithm. 

Moving beyond the ESSW framework, it is tempting to account for indels by 
simply using edit distance instead of the Hamming distance. Recall that the edit 
distance or Levenshtein distance between two strings is given by the minimum 
number of operations needed to transform one string into the other, where an op- 
eration is an insertion, deletion, or substitution of a single character. However, 
no analytical expression is known for the expectation of edit distance under stan- 
dard indel models and computing such an expression appears difficult — if at all 
possible. An alternative idea is to compute the maximum likelihood estimator for 
the time elapsed between two species given their sequences. But this involves 
solving a nonconvex optimization problem and the likelihood is only known to be 
efficiently computable under a rather unrealistic assumption known as reversibil- 
ity [TKF91] (see below). 

We use a different approach. We divide the sequences into quantile blocks 
(the first x%, the second x%, etc.). We show that by appropriately choosing x 
above we can make sure that the blocks in different sequences essentially "match" 
each other, that is, they are made of mostly homologous sites. We then compare 
the state frequencies in matching blocks and build an additive metric out of this 
statistic. As we show below, this is in fact a natural generalization of the Ham- 
ming estimator of the ESSW framework. However, unlike the Hamming distance 
which can easily be analyzed through standard concentration inequalities, prov- 
ing rigorously that our approach works involves several new technical difficulties. 
Our analysis relies on a branching process analysis of the site displacements. We 
give a quick proof sketch after the formal statement of our results in Section 1.2. 

Related work For more background on models of molecular evolution and phy- 
logenetics, see, e.g., [GL99, SS03, Fel04]. Following the seminal results of [ESSW99a], 
there has been much work on sequence-length requirement, including [Att99, 
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ESSW99b, HNW99, SS99, CKOl, Csu02, SS02, KZZ03, MR06, DMR06, LC06, 
DHJ+06, MosOV, GMS08, Roc08, DMR09]. 

The multiple sequence alignment problem as a combinatorial optimization 
problem (finding the best alignment under a given pairwise scoring function) is 
known to be NP-hard [WJ94, Eli06]. Most heuristics used in practice, such as 
CLUSTAL [HS88], MAFFT [KMKM02], and MUSCLE [Edg04], use the idea of 
a guide tree, that is, they first construct a very rough phylogenetic tree from the 
data (using for instance edit distance as a measure of evolutionary distance), and 
then recursively construct local alignments produced by "aligning alignments." 

To our knowledge, little theoretical work has been dedicated to the joint esti- 
mation of alignments and phylogenies, with the exception of Thatte [Tha06] who 
gave consistency results for the reversible case in the limit where the deletion-to- 
insertion ratio tends to 1. However, no sequence-length requirement is obtained 
in [Tha06]. In recent related work, the problem of reconstructing ancestral se- 
quences in the presence of indels was considered [ADHRIO, ABHll]. 

1.1 Model of Sequence Evolution 

Phylogeny A phytogeny is represented by a binary tree T = {V,E), whose 
leaves L c V correspond to extant species, and whose bifurcations denote evo- 
lutionary events whereby two new species are generated from an ancestor. The 
root of the phylogeny, denoted by r{T), represents the common ancestor of all 
the species in the phylogeny, and we assume that all edges of T are directed away 
from r(T); so, if e = {u, v) is a branch of the phylogeny, u is the parent of v and 
V is the child of u. Moreover, if v' is in the subtree of T rooted at u, we call v' a 
descendant of u and u an ancestor of v'. 

Along each branch of the phylogeny, the genetic material of the parent species 
is subject to modifications that produce the genetic material of its child species. 
A common biological assumption is that the genetic material of each species u 
can be represented by a binary sequence cr„ = (cr^, . . . , cr^") of length over a 
finite alphabet — for ease of presentation, we work with a binary alphabet {0, 1} 
(but see Section 5 for extensions to richer alphabets) — and that the changes to 
which a„ is subjected along the branch e = (u, v) are described by a Markov 
process. In particular, the Markov property implies that, given the sequence cr^ 
of u, the sequence is independent of the sequences of the species outside the 
subtree of T rooted at u. 

A simplifying assumption commonly used in phylogenetic s is that all species 
have sequences of the same length and, moreover, that every site, i.e., every coor- 



4 



dinate, in their sequences evolves independently from every other site. In particu- 
lar, it is assumed that, along each branch e = {u,v) of the phylogeny, every site 
of the sequence au is flipped with probability pe to the value 1 — al independently 
from the other sites. This model is known as the Cavender-Farris-Neyman (CFN) 
model. A simple generalization to {A, G, C, T} is known as the Jukes-Cantor (JC) 
model. See, e.g., [Fel04]. 

Accouting for indels In this paper, we consider a more general evolutionary 
process that accounts for the possibility of insertions and deletions. Our model 
is similar to the original TKF91 model [TKF91], except that we do not enforce 
reversibility. In our model, every edge e — {u,v)of the phylogeny is characterized 
by a quadruple of parameters {te;r]e, He, K), where te is the evolutionary time 
between the species u and v, and r]e, fJ^e and Ae are respectively the substitution, 
deletion and insertion rates. The Markov process by which the sequence at v is 
obtained from the sequence at u is defined below. See e.g. [KT8 1] for background 
on continuous-time Markov processes. 

Definition 1.1 (Evolutionary Process on a Branch) Given an edge e = {u, v), 
with parameters {tg; rie, fJ^e, K), the sequence ay atv is obtained from the sequence 
(7u at u according to the following Markov process: 

1. Intialize := au, := and t£ := te (where tg is the remaining time 
on the edge e). 

2. While te > 0, 

• (Timing of next event) let Iq, Ii, . . ., Ik^ be exponential random vari- 
ables with rate Ag, -Di, • • •, exponential random variables with 
rate fie, cind Mi, . . ., Mk^ exponential random variables with rate rje; 
suppose that these random variables are mutually independent and let 
T be their minimum; 

• ifT>t£ the process ends at t^; otherwise: 

- (Insertion) if Ij — T, insert a new site whose value is chosen 
uniformly at random from {0, 1} between the sites al and (7^+^ of 

CTv,' 

- (Deletion) if Dj = T, delete the site alfrom a^; 

- (Substitution) and if Mj = T, replace a^ by 1 — ai; 

(If j — 0, then ai is undefined and, if j — Ky, then ai'^^ is undefined.) 
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• (Remaining time) update according to these changes, and update 
to reflect the new sequence length; set the remaining time :— 
ti - T; 

In words, the evolutionary process defined above assumes that every site of the 
sequence of the parent species is, independently from the other sites, subjected 
to a sequence of evolutionary events that flip its value; these events are distributed 
according to a Poisson point process of intensity r/e in the time interval [0,te]- 
However, the site may get deleted and therefore not be inherited by the sequence 
of the node v\ this is determined by whether an exponential random variable of 
rate ji^ is smaller than t^. While each site of the parental sequence au is subjected 
to this process, new sites are introduced in the space between existing sites at rate 
Ae, and each of these sites follows a similar process for the remaining time. In 
essence insertion and deletion events are governed by an independent branching 
process for each ancestral site. Note further that the order of the sites, as described 
above, also plays a role. 

Remark 1.2 Unlike [TKF91], we do not use an "immortal link" and we do not 

assume that the length process is at stationarity. Our techniques can also be 
applied to the TKF91 model without much modifications. We leave the details to 
the reader 

Given the evolutionary process on a branch of the phylogeny, the evolutionary 
process on the whole phylogeny is defined as follows. 

Definition 1.3 (Evolutionary Process) Suppose that every site of the sequence 
ar{T) at the root of the phylogeny is chosen to be or 1 uniformly at random. 
Recursively, if cTu is the sequence at node u and e = (u, v) is an edge of the 
phylogeny, the sequence at node v is obtained from the sequence cr„ by an 
application of the evolutionary process on a branch described by Definition 1.1. 

For ease of exposition, we first present our proof in the special case where 
the substitution, insertion and deletion rates are the same on all edges of the phy- 
logeny. 

Definition 1.4 (Ultrametric Assumption) Under the ultrametric assumption, the 
leaves of the phylogeny are contemporaneous, that is, there exists H such that for 
each u & L the sum of evolutionary times on the branches between u and the 
root is H. 
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Definition 1.5 (Molecular Clock Assumption) Under the molecular clock as- 
sumption, we assume that the ultrametric assumption holds. Moreover, there exist 
T), /J, and A such that r/e = rj, /le = l-i' cmd Ae = A, for all e G E. 

We discuss a more general case in Section 5. 

Notation In the sequel, we label the leaves of the phylogeny with the positive 
integers 1, 2, . . ., n, so that L = {1, . . . , n}, and the root r(T) of the phylogeny 
with 0. 

1.2 Main Result 

Statement of results We begin with a consistency result. Here we consider a 
completely general phylogeny, that is, neither the ultrametric, nor the molecular 
clock assumptions need hold. 

Theorem 1 (Consistency: Finite Case) Assume that < t(,,r]e, X^,, Pe < +oo, 
for all e & E. Then there exists a procedure returning the correct tree from the 
sequences at the leaves, with probability of failure approaching as the sequence 
length at the root of the tree goes to +oo. 

Our main result is the following. For simplicity we first work under the symmetric 
two-state case and assume that the Molecular Clock Assumption holds. 

Theorem 2 (Main Result: Two-State, Molecular Clock Case) Consider the two- 
state model under the Molecular Clock Assumption. Assume further that there 
exist constants 

0<f,9< +00, 

independent ofn, such that 

f <te<g, WeeE. 

Moreover, assume that 

Tje = r], Ae = A, lie = A*, Vc G E, 

where rj, A, and /i are bounded between constants (independent ofn)0 < rj < 
fj < +00, = A < A < +00, and = < p, < +oo respectively. Under the 
assumptions above, for all (3' > there exists (3" > such that there exists a 
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polynomial-time algorithm solving the phylogenetic reconstruction problem ( that 
is, returning the correct tree) with probability of failure n"^ , if the root sequence 
has length kr > . ^ 

Remark 1.6 (Branch Lengths) Our assumption that all branch lengths t^, e G 
E, satisfy f < te < g is standard in the sequence-length requirement literature 
following the seminal work of [ESSW99a]. 

Extensions In Section 5 we derive the following extension. Let Q be a re- 
versible 4x4 rate matrix with stationary distribution tt. (Larger alphabets are 
also possible.) The GTR sequence evolution process is identical to the one de- 
scribed in Definition 1.1 except that the substitution process is a continuous-time 
Markov process with rate matrix rjeQ. 

Theorem 3 (Main Result: GTR, Bounded-Rates Case) Consider the GTR model 
with rate matrix Q under the Ultrametric Assumption (but not necessarily the 
Molecular Clock Assumption). Assume further that there exist constants 

< f,9, V, V, A, A, [I, fi, < +00, 

independent ofn, such that 

f <te<g, ri<rie<fj, ^ee E. 

Moreover, assume that 

Xe = A, /Xe = A*, Ve e E, 

where A and fi are bounded between constants (independent ofn)(} = \<\< 
-\-oo and = /i < jl < +oo respectively. We refer to the conditions above as 
the Bounded-Rates Assumption. Under the assumptions above, for all /3' > 
there exists j3" > such that there exists a polynomial-time algorithm solving 
the phylogenetic reconstruction problem (that is, returning the correct tree) with 
probability of failure , if the root sequence has length kr>n^. 



'in [DRIO], a preliminary version of this result was announced without proof with the much 
stronger assumption that \,p, = 0(1/ log n), that is, that the indel rates are negligible. Here we 
show that this assumption can be relaxed (at the cost of longer sequences). 
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Proof sketch Consider the two-state, molecular clock case. As we noted before, 

unlike the classical setting where the Hamming distance can be analyzed through 
standard concentration inequalities, proving rigorously that our approach works 
involves several new technical difficulties. The proof goes through the following 
steps: 

1. Expectations. We first compute expectations of block statistics, which in- 
volve analyzing a continuous-time Markov process. We use these calcula- 
tions to define an appropriate additive metric based on correlations between 
blocks. 

2. Sequence length and site displacements. We give bounds on how much 
sequence lengths vary across the tree, through a moment-generating func- 
tion argument. Using our bounds on the sequence length process, we bound 
the worst-case displacements of the sites. Namely we show that, under our 
assumptions, all sites move by at most 0{^/k log A;). 

3. Sequence Partitioning. We divide each sequence in blocks of size roughly 
/c^ for C, > 1/2, where k is the sequence length at the root. From our bounds 
on site displacements, it follows that the blocks roughly match across dif- 
ferent sequences. In particular, we bound the number of homologous sites 
between matching blocks with high probability and show that the expected 
correlation between these blocks is approximately correct. 

4. Concentration. Finally, we show that our estimates are concentrated. The 
concentration argument proceeds by conditioning on the indel process sat- 
isfying the high-probability conditions in the previous points. 

The crux of our result is the proper estimation of an additive metric. With such an 
estimation procedure in hand, we can use a standard distance-based approach to 
recover the phylogeny. 

Organization The rest of the paper is organized as follows. The evolutionary 
distance forming the basis of our approach is presented in Section 2. We describe 
our full distance estimator in Section 3 and prove its concentration in the same 
section. Extensions are described in Section 5. 
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2 Evolutionary Distances 



Consider the two-state, molecular clock case. In this section, we show how to 
define an appropriate notion of "evolutionary distance" between two species. Al- 
though such distances have been widely used in prior phylogenetic work and have 
been defined for a variety of models [SS03, Fel04], to our knowledge our defi- 
nition is the first that applies to models with indels. We begin by reviewing the 
standard definition in the indel-free case and then adapt it to the presence of indels. 
Our estimation procedure is discussed in Section 3. 



Suppose first that A = = 0, that is, there is no indel. In that case, the sequence 
length remains fixed at k and the alignment problem is trivial. Underlying all 
distance-based approaches is the following basic definition. 

Definition 2.1 (Additive Metric) A phytogeny is naturally equipped with a so- 
called additive metric on the leaves P : L x L — > (0, +oo) defined as follows 



where VTia. b) is the set of edges on the path between a and h in T and where 
Ue is a nonnegative function of the parameters on e — in our case, t^, Tjf., Ag, and 
He- For instance, a common choice for uj^ would be uj^, = rjetg in which case 
V{a,b) is the expected number of substitutions per site between a and b. Often 
T>{a,b) is referred to as the "evolutionary distance" between species a and b. 
Additive metrics are characterized by the following four-point condition: for all 
a, b,c,d& L, 



Moreover, assuming cUg > Ofor all e E E, it is well-known that there exists a one- 
to-one correspondence between T> and T as a weigthed tree with edge weights 
{uJejeeE- We will discuss algorithms for constructing T from T> in Section 4. For 
more background on tree-based metrics, see [SS03]. 

Definition 2.1 impHes that phylogenies can be reconstructed by computing 
I>(a, b) for all pairs of leaves a,b E L. Assume we seek to estimate the evolu- 
tionary distance between species a and b using their respective sequences. In a 



2.1 The Classical Indel-free Case 




eePT(a,6) 



V{a, b) + V{c, d) < max{X'(a, c) + V{b, d),V{a, d) + V{b, c)}. 
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first attempt, one might try the (normalized) Hamming distance between Ua — 
{a\, . . . , a^) and 0-5 = (a^, . . . , cr^). However, the expected Hamming distance — 
in other words, the probability of disagreement between a site of a and h — does 
not form an additive metric as defined in Definition 2.1. Instead, it is well-known 
that an approriate estimator is obtained by "correcting" the Hamming distance 
for "multiple" substitions. Denoting by 1-l{aa,(Jf,) the Hamming distance be- 
tween (Ta and (Tfo, a Markov chain calculation shows that V{a, b) = log(l — 

2E[H{aa, o-b)]), with the choice coe = rjete- See e.g. [Fel04]. In a distance-based 
reconstruction procedure, one first estimates V with 

P(a,6) = -^log(l-2%„,(76)), (1) 

and then applies one of the algorithms discussed in Section 4 below. The sequence- 
length requirement of such a method can be derived by using concentration results 
for H [ESSW99a, Att99]. 



2.2 Taking Indels into Account 

To simplify the presentation, we assume throughout that \^ \x. The case X — 
follows from the same argument. 

In the presence of indels, the estimator (1) based on the Hamming distance 
is difficult to apply. One has to first align the sequences, which cannot be done 
perfectly and causes biases as well as correlations that are hard to analyze. Alter- 
natively, one could try a different string distance such as edit distance. However, 
computing the expectation of edit distance under indel models appears difficult. 

We use a different approach involving correlations between state frequencies. 
We will eventually apply the estimator to large sub-blocks of the sequences (see 
Section 3), but we first describe it for the full sequence for clarity. For a node u, 
let Ku be the (random) length of the sequence at u and Z„, the number of O's in 
the sequence at u. Then, our distance estimator is 

V{a, h) = {^Za - ]^K}j {^Zb - ]^K^ . 

We now analyze the expectation of this quantity. For e y, we let 




be the deviation of from its expected value (conditioned on the sequence 
length). 
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Single channel Suppose T is made of a single edge from the root r to a leaf a 

with parameters t, i], X, fi. Assume first that the original sequence length is kr = 1. 
Let Ka be the length of the sequence at a. Then Ka is a continuous-time branching 
process and, by Markov chain calculations [AN72, Section III.5], its moment- 
generating function is 

F{s,t) = E Is^"] = Vt 7^ , , (2) 

By differentiating F(s, t) we derive 

E[K„] = e-^'^-^)*, (3) 

and 

Var[ir„] = ^i±^[e-('^-^)* - e'^^^-^^']. (4) 
/i — A 

Let K* be the number of "new" sites at a, that is, excluding the original site if it 
survived. (We ignore the substitutions for the time being.) The probability that 
the original site survives is e~'**. Then, 

E [K*] =E[Ka- l{original site survives}] = e'^'^-^)* - e"''*, 

by linearity of expectation. 

We now take into account substitutions. Assume that the original sequence 
length at r is a random variable Kr and that the sequence at r is i.i.d. uniform. 
Denote by Z^. the number of O's at r. The probability that a site in r that is still 
surviving in a has flipped its value is 

p — P[ state flips odd number of times in timet] 



U (2J + 1)! 

e"''* sinh rjt 
1 - 



Also, note that a new site created along the path between r and a has equal chance 
of being or 1 a? the end of the path. Then, we have: 

Lemma 2.2 (Single Channel: Expected Deviation) The following holds: 
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Proof: We have 



E[A„|X„Z,] 



ZrC-^^l -p) + {Kr - Zr)e-^^p 

2 2 




(5) 



where on the first two lines: 

1. the first term is the number of original O's surviving in state 0; 

2. the second term is the number of original I's surviving in state 0; 

3. the third term is the number of new sites surviving in state 0; 

4. the fourth term is half the sequence length at a given the length at r. 



Fork channel Consider now a "fork" tree, that is, a root r from which emanates 

a single edge = (r, u) which in turn branches into two edges = {u, a) and 
Cf, = {u, b). For X = a,b, u, we denote the parameters of edge by t^, \x, (J'x, Vx- 
Our goal is to compute E[T){a, b)] assuming that the sequence length at the root is 
kr. We use (5), the Markov property and the fact that Z„ conditioned on Ku is a 
binomial with parameters 1/2 and Ku. We get: 

Lemma 2.3 (Fork Channel: Expected Distance) The following holds: 



E V{a,b) 



4 
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Proof: We have 



E 



V{a,b) 



E [AaAb] 

E[E[AaA,\K,,Z^]] 

E [E [A, I K,, E [Ab I Ku, Z^]] 



_ Q-^riata^-tiata^-^mtb^-t^bh 



where we used (3) and Lemma 2.2. ■ 

Molecular clock We speciahze the previous result to the Molecular Clock As- 
sumption. That is, we assume, for x ~ a, h, u, that — \, ji^ — A*, and 7]^ — rj. 
Note that by construction ta = h (assuming species a and b are contemporary). 
We denote t — ta and t = tu + ta. Denoting k = !si±J^^^ then get: 

Lemma 2.4 (Molecular Clock: Expected Distance) The following holds: 



E 



V{a,b) 



^-(4ri+fj,+\)t 



Letting 
we get that 



= 477 + // + A, 



-2logE[K-^V{a,b)] = 2^t, 



which is the evolutionary distance between a and b with the choice Ue — /3te 
Therefore, we define the following estimator 



V*{a,b) = -2\og K-^V{a,b). 



3 Distance Computation 

We now show how to estimate the evolutionary distance between two species by 
decomposing the sequences into large blocks which serve as roughly independent 
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samples. We use the following notation: Mt — e Dt — e 6 — fj, — X, 

(p = 12 + \,andTt = 5'^X(1 - Mt). 

We show in Section 4 that the time elapsed between the root and the leaves is 
bounded by ^ logg n. Hence under our assumptions 

-2 2 -2 - 2 

f-l = g-(M+A)^log2n ^ g-M^log2n < < g-^^log2" < g(^+;i)^ loga n ^ 

(6) 

T;^i < e-''Ti°S2" < L>j < 1, (7) 

and 



< = Ai— < A^logon^^ = e-'T^^ean _ i< T„, 

(/^-A)t - / A^log^n 

(8) 

where we used that the function — e~^) is non-negative and decreasing since 
its derivative is 



xe ^ — (1 — e ^) _^il + x) — e 



< 0, Xy^O. 



Note that the bounds above are polynomials in n with exponents depending only 
on /, g, A and p,. In particular, we will ultimately take sequence lengths kr of 
the form with ,5" chosen much larger than the exponent in T„. We call poly- 
nomials in n (such as T„) which have an exponent not depending on (3", small 
polynomials. As a result, the following notation will be useful. For a function 
W{kr) of kr, we use Sn{W{kr)) to denote a function smaller or equal to W{kr) 
up to a small polynomial factor. (The latter will be used similarly to the big-O 
notations.) 



3.1 Concentration of the Indel Process 

Sequence length We first show that the sequence length is concentrated. Let T 
be single channel consisting of edge e = (r, a). Let kr be the length at r. 

Lemma 3.1 (Single Channel: Large Deviations of Sequence Length) For all ^ > 

and kr > kr — with j3"' > large enough, with probability at least 1 — k~'^. 



Ka = KMt ±Sr^\ V kr log k 



where the small polynomial factor in Sn [ y kr log kr depends on 7 as well 
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Remark 3.2 Although we stated Lemma 3.1 for the full sequence, it will also be 

needed for "half-sequences" and "blocks." In particular, we use the previous 
lemma to track the position of sites. In that context, one should think ofk^ as the 
position of a site in r and K^, as its position in a. Then we can use krfor the full 
sequence length at r. See Section 3.2. 

Proof: We think of Ka as 

i=l 

where Ka,i is the number of sites generated by a single site of the sequence at r. 
Intuitively, Ka,i is the number of sites that were inserted between the sites i and 
i + 1 of the sequence at r, plus the site at position i itself if it survived. Clearly 
the variables {iCa.iji are mutually independent. 
Using (3) we obtain that 

For £ > 0, by Markov's inequality we have 

F[Ka > KMt + Ke] < s-'='-(^*+^)E[s^»] = (s-(^*+^)E[s^«'1])'''' . (9) 

We take s = 1 -\- Ce for C > to be determined. 
We have 

^ ^ \{s - 1) - e('^-^)*(As - p) 
(p- XMf^)Ce + 5Mi^ 
A(l - Mt-^)Ce + 5Mf^ 
S-\ijMt- X)Ce + 1 
S-^X{Mt - l)Ce + 1 
1 - {X-^fiVt - l)Ce 
1 - VtCe 

+ 0O 

= [1 - (A- - l)Ce\ Yy^tCeW 

i,=0 

whenever VtCe < 1. Hence, if T„C£ < 1 is bounded away from 1 (independently 
of n), we have using (8) 

E[s^-i] = [l-{X-^^iTt-l)Ce][l + TtCe + {VtCef + 0{{i:r,Cef)] 
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Moreover, using the binomial series and (6) and assuming Ce < 1 



-(Mt+e) 



+00 

E 

t=0 



-Mt - e){-Mt -e -!)■■■ {-Mt -e-i + 1) 



[Ce] 



< l-{Mt + e){Ce) + 



+00 



t=3 



= l-{Mt + e){Ce) + 



{Mt + s){Mt + s + l) 



whenever s is small and TnCs < 1 is bounded away from 1 (independently from 
n). Therefore, 

^_(M.+e)E[^K„,,] ^ 1 _ ,^ce) + M,r,(C£)^ + + ^)iMt + e + 1) ^^^y 

-{Mt + e)Mt{Cef + 0{{T,,Cef). 

Note that the second term on the RHS depends on C whereas the remaining terms 
depend on C^. Taking C = T:^^'^Co{^) with Co(7) > small enough and c — 
Co (7) > large enough, using (9) with the choice 



e — c\ 



' Tllog kr 



we get that 



P 



Ka > KMt + cJTlkrlogk 



< F[Ka > KMt + kr€] 
^ I 0{\0gkr)\ 



< kz\ 



Note that our choice of e satisfies TnCe < 1 for kr a large enough polynomial of 
n (compared to the small polynomial T„). 

A similar inequality holds for the other direction. ■ 
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Figure 3.1: The Fork Channel 



Correlated sites Now let T be the fork channel consisting of nodes r, u, a 
and b as in Figure 3.1. Assume that a and b are contemporary, call t the time 
separating them from u, and denote by Sab the number of sites in a and b that are 
jointly surviving from u. These are the sites that produce correlation between the 
sequences at a and b. All other sites are essentially noise. We bound the large 
deviations of Sab- 

Lemma 3.3 (Fork Channel: Large Devations of Jointly Survivmg Sites) Condition 
on the sequence length at u being k^. Then, for all'-f > and all ku > ku = n 
with 13"' > large enough, with probability at least 1 — A;^'^, 



13" 



Sab — kuDf i Sn 



ku log ku 



where the small polynomial factor in Sn ysj log k^j depends on 7 as well. 

Proof: Each site in u survives in a with probability Dt. The same holds for b 
independently. The result then follows from Chernoff's bound (see Appendix). 
We have 



P 



Sab < kuD^ - cV Tlku log ku 



< P 

< exp 



c / 7 n2 . h^l'^ogku 2 
Sab < kuD^ - c\l kuDt 

rCu, 



-c'TlD^logku) 



for c = 0(7) > large enough, where we used (7). 
The other direction is similar. ■ 
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3.2 Sequence Partitioning 



From Lemma 3.1, it follows that the sites of the root sequence (or of an internal 
sequence) remain fairly close to their expected position at the leaves. We take ad- 
vantage of this fact by dividing each sequence into blocks of size asymptotically 
larger than the typical displacement implied by Lemma 3.1. As a result, match- 
ing blocks in different sequences share a significant fraction of sites. Moreover, 
distinct blocks are roughly independent. We estimate the evolutionary distance 
between two leaves by comparing the site frequencies in matching blocks. This 
requires some care as we show next. 

Consider the fork channel. We seek to estimate the evolutionary distance 
V{a,h) between a and h (normalized by the sequence length at u). 

Partitioning the leaf sequences Let be some deterministic length (to be de- 
termined), and consider the first sites in the sequences a a and ajj at the nodes a 
and h respectively. If the sequence at a or 6 has length smaller than k^, we declare 
that our distance estimate V{a, h) (see below) is +oo. 

We divide the leaf sequences into L blocks of length £ where ^ = [/cq] , for 
some I < C < 1 to be determined later, and L = [ko/£\. We let k^ = £L. For 
alH = 1, . . . , L, we define the i-th block aa,i of a to be the subsequence of aa 
ranging from position (i — 1)£ + 1 to position ii. We let Za,i be the number of 
zeros inside aa,i and define the block deviations 



for alH = 1 , . . . , L. And similarly for the sequence at b. 

Using the above notation we define our distance estimator next. Assume that 
L is even. Otherwise, we can just drop the last block in the above partition. Our 
estimator is the following: 



Notice that in our summation above we skipped every other block in our sequence 
partition to avoid overlapping sites and hence decrease potential correlations be- 
tween the terms in the estimator. In the rest of this section, we analyze the proper- 
ties of 'D{a, b). To do this it is helpful to consider the sequence at u and the events 
that happened in the channels defined by the edges {% a) and {% b). 



A. 



i 

2' 




j=0 
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Partitioning the ancestral sequence Let us choose to be the largest integer 
satisfying 

iuMt < L (10) 

Suppose that the sequence at node u is not shorter than k'^ = (L — and 
define the i-th ancestral block a^^i of u to be the subsequence of ranging from 
position (i — + 1 to position for alH < L — 1. Given Lemma 3.1, the 
choice of in (10) is such that the blocks of u and the corresponding blocks at a 
and h roughly aUgn. 

In order to use the expected evolutionary distance as computed in Lemma 2.4, 
we define an "interior" ancestral block which is guaranteed with high probabil- 
ity to remain entirely "inside" the corresponding leaf block. Let 5u = \L + 
■^^Sn{\/k'^\og k'^)\ , where the small polynomial factor is the maximum of those 
in the proofs of Lemma 3.1 and Lemma 3.3 for a given choice of 7. (The L — 
o{-\/ko) in 5u is needed only when (10) is a strict inequality. See the proof of 
Lemma 3.4 below.) We define the i-th (ancestral) interior block a'^^oiu to be the 
subsequence of » ranging from position [i — 1)£„ + 5u of (T„ to position i£u — Su- 
Notice that (5„ = Sn{y/ko^ogk^), while = ^^(/co)- Therefore, for ko > k^, 
where is sufficiently large, {i — l)iu + Su < iiu — so that the sequence a'^ .^ 
is well-defined. 

Also, for all 2 = 1, .... L — 1, we define ■, ■ to be the position of the 
leftmost, respectively rightmost, site in the sequence a„ descending from the site 
at position (i — !)£„ + 5u, respectively iiu — 6u of cr«- Similarly we define x'^, ^ and 
j. Given this notation, we define the following "good" event 

E[^{yi<L-l : {i-l)e< x^, < (2 - l)i + 2Mt8u, 

Intuitively, when the event E[ holds, all surviving descendants of the interior block 
(7^ j are located inside the blocks aa,i and Ub^i respectively (and the blocks remain 
large enough). 

To argue about block independence, we also define the exterior block a" ^ of 
u to be the subsequence of (T„ j ranging from position (i — !)£„ — 5„ of (T„ to 
position ily_ -\- 5^ with corresponding positions x'^ ,^, y'^p x'l^^ and y^'j and good 
event E'( defined similarly as above. 

We define 
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We show that this event holds with high probability, conditioned on the sequence 
length Ku at u being at least k'^. Figure 3.2 shows the structure of the indel process 
in the case that the event £i holds. 

interior block boundary exterior block boundary 




sequence at u 
blocks have length i, 



sequence at a 

blocks have length £ £uMt 



Under event £i : 

- the descendants of the interior blocli of u fall inside the block of a. 

- the descendants of the exterior block of u encompass the whole block of a, 

- the windows of boundary uncertainty have length 2duMt . 

Figure 3.2: Under the event £i the descendants of the interior blocks of fall 
inside the corresponding blocks of aa', the descendants of the exterior blocks of 
(T„ contain all surviving sites inside the corresponding blocks of cTq; the windows 
of uncertainty have length 2Mt6u- 



Lemma 3.4 (Interior/Exterior Block Is Inside/Outside Leaf Block) Conditioned 
on the event {Ku > k'^}, we have 

ns,]>i-i6L (ly. 

Proof: It follows from Lemma 3.1 that the leftmost descendant of the site at 
position (i — + Su of cr„ is located inside the sequence of node a at position 
at least 

Mt{{t - 1)4 + Su) - SniVK^ogk'J > Mt{{t-l)eu + L) 

with probability > 1 — j . The other bounds follow similarly. Taking a union 
bound over all z's establishes the result. ■ 
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Block correlation Let Sab,i be the number of common sites in the blocks aa,i 
and (Tb,i that are jointly surviving from u. Similarly we define S'^j, ^ and S'"^ ^ where, 
for ^ = a,b, a'^- (respectively c^',.) denotes the subsequence of ranging from 
position j (respectively x^' j) to position y'^^^ (respectively y'^.j). We define a good 
event for Sab,i as 

£2 = {Vi < L - 1 : 4 A' - 3^t<^« < Sab,i < LDl + 3Mt5J. 

Lemma 3.5 (Jointly Surviving Sites in Blocks) Conditioned on the event {Ku> 
k'^, we have 



m] > 1 - 18L (1) 



Proof: We bound 



¥[E^] = ¥[E^ n £1] + ¥[£^ n £1"] < ¥[£^^ n £1] + P[£^] < P[£2' n £1] + I6L 

By construction, under 81 we have 5^^ ^ < Sab,i < -S'^';, ^ so that 

< P [3^, 5:,, < (4 - 25„ + l)Dl - Sni^/KMK) 
+P [3^, Sl,^, > (4 + 25„ + 1)^2 + 5„( V^p^ 

by Lemma 3.3, where we also used the fact that < Mf.M 



3.3 Estimation Guarantees 

We are now ready to analyze the behavior of our estimate V{a,h). In this sub- 
section we compute the expectation and variance of V{a, h). We denote by X a 
realization of the indel process (but not of the substitution process) on the paths 
between u and a.h. We denote by 8 the event that {K^ > k'^, 81, and 82 are 
satisfied. Suppose that k^ > k^ (defined in Section 3.2). 
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Lemma 3.6 (Block Independence) Conditioning on I and £, the variables 



are mutually independent. 

Proof: Observe that when > k'^ the ancestral blocks Ou^i are well defined. 
Assuming that > k^, the interior blocks a'^ ^ are also well defined and disjoint. 
Hence, for a fixed X under S, for alH < L — 1, both » and A5 j depend 
on the subsequence of ranging from position (i — — 5„ + 1 to position 
i-^u + 5« — 1. In this case, for j e (1, . . . , L/2 — 1}, different Aa_2j+iA(,^2i+iS are 
functions of different subsequences of ay,. Observe that, since the root sequence is 
i.i.d. uniform and the insertions above u are also i.i.d. uniform, the state of every 
site in (t„ is uniform and independent from the other sites. It follows from the 
above observations that {Aa,2j+i^b,2j+i}j=i~^ are mutually independent. ■ 

Lemma 3.7 (Expected Correlation under Good Event) We have 

Proof: Let A^ ^ be the contribution to A^ j from those common sites between a 
and b that are jointly surviving from u. Let A^f = A^^j — A^ j. And similarly for 
b. Then 

E[A„,A,, \I,S] = E[{Al, + A^J){Al + A^^) 1 1, S] 
= E[AlAl\I,S], 

since the contribution from A^f and A^f is independent and averages to 0. Write 
A^ ■ as a sum over the jointly surviving sites, that is, 

AS . ^ f ^0) _ M 

where z)l';^ is 1 if the corresponding site of a is 0. Note that the terms in parentheses 
have zero expectation under X and E. Then, 

e\a%a%\xa^'y.^ 
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by independence of the sites. We compute the expectation above. We have 



Therefore 



E 



■'bA 



1 



(i) 



E 



I.S 



1 

4 



1 1 + e-^'i^ 1 

2 2 4 



The result then follows from the definition of £2- • 
Lemma 3.8 (Variance under Good Event) We have 



Ya.r[Aa,Ab,i\I,S]<-f. 



Proof: By Cauchy-Schwarz, we have 

E[AlAl\X,£] < (E[Al,|X,£:]E[At,|X,£:]) 



1/2 



1/2 



- 16 ' 

where we used the fact that the length of the sequences aa,i and 0-5 ^ is determinis- 
tically i, and the number of zeros in aa,i and cr^^j follows a Binomial distribution 
with i trials and probability 1/2. ■ 

Lemma 3.9 (Distance Estimate) We have 



E 



V{a,b) I X,£] = ^e-(^''+'^+^)^£±5„ (V^^b^), 
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and 



Var 



V{a,b) I X,E 



< 



8 L^o-^J 



In particular, the standard deviation 



STD 



V{a,h) \ X,8 



3C-1 



= o(V^o), 



for C, > 1/2 small enough. 



Proof: From Lemma 3.6, the L/2 = [kQ/i\/2 terms in V{a,b) are mutually 
independent. The proof then follows from Lemmas 3.7 and 3.8, and the definition 
ofL. ■ 



3.4 Concentration 



We now show that our distance estimate is concentrated. For notational conve- 
nience, we denote by P* the probability measure induced by conditioning on the 
event {Ku > k'^^}. Recall that the event £ is contained in {f^^ > k'^}. 

Lemma 3.10 (Concentration of Distance Estimate) Let a > be such that ( — 
a > 1/2, and (3— 1 — ( — 2a> 0, /or ( > 1/2 small enough. Then for kg large 
enough 



-V{a,b) 



_ „-(4?7+M+A)t 



> 



1 



< O 



Proof: We use Chebyshev's inequality. We first condition on X, S. Recalling that 
i = [A;^], note that 



P! 



< P! 



<K 



£ 



1 



X,£ 



^ 1 

V{a, b)>E [p(a, b) \ I,sj- Sn (V^ologA^o) +-—\I,S 



< 



IK 



l-C, 



(f - Sn {Vh log ko 

1 



O 



l,l-C-2a 
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The other direction is similar. Taking expectation over X, we have 



1 



> ^ I ^ 

Let 



< o 



Choose 7 > in Lemmas 3.1 and 3.3 large enough so that 

7-(l-C)>^- 
Then, from Lemmas 3.4 and 3.5, we have 



K 



< P! 



< O 



4 



> 



Uc 



> 



ua 



p:[£]+p:[£^ 



1 

ftp 



The proofs of Theorems 1 and 2 are given in the next section. 



4 Putting it All Together 

Large-scale asymptodcs We are ready to prove our main result in the molecular 
clock case. We postpone the more general case to the next section. A last bit of 
notation: For a pair of leaves a, 6 G [n], we denote by tab the time between a, b 
and their most recent common ancestor. 

Proof of Theorem 2: We first give a bound on the diameter of the tree. Let h 

(resp. H) be the length of the shortest (resp. longest) path between the root and a 
leaf in graph distance. Because the number of leaves is n we must have 2^ < n 
and 2^ > n. Since all leaves are contemporaneous it must be that Hf < hg. 
Combining these constraints gives that the diameter Diam satisfies 

2- logo n<2h< Diam < 2H < 2-^ logo n. 
9 ~ ~ f 
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Given our bound on the diameter of the tree, it follows that the time from the root 

2 

r of the tree to any leaf is at most y log2 n. Suppose that the length kr at the root 
of the tree satisfies kr > k* — k*{ko), where k* is the minimum integer satisfying 

where the small polynomial factor is taken to be the one used in Lemma 3.1. 
Lemma 3.1 and a union bound imply then that with probability at least 



for all nodes u 



1 - o{n) ■ {k;y 



K, > k' 



Lemma 4.1 (Concentration of Distance Estimate) For all a' > Q, ^' > 0, there 



exists kn 



n 



/3" 



with (3"' > large enough so that if the sequence length at the 



root is kr > k*{ko), then 

4 



P 



Va, 6 G [n], 







^ 1 


— ( 







Proof: This follows from Lemma 3.10 and our observation above that, if kr > 
k*{ko), with probability at least 1 — 0{n) ■ {k*)~'^, then Ku > k'^ for all nodes u. 



Given our bound on the diameter of the tree, it follows that for all pairs of 
leaves a, b and small £ > 

g-(4^+M+A)t„6±£ ^ ^_(4r,+^+A)U(l ± 0{e)) > ^(1 ± 0{e)). 

Therefore, choosing a' large enough in Lemma 4.1, we get that all distances can 
be estimated within a small e simultaneously with probability going to 1. 

Using the standard Buneman algorithm, we can recover the tree efficiently. 
See e.g. [SS03]. ■ 



Constant-size case The proof of Theorem 1 for the molecular clock case builds 
on the proof of Theorem 2 by treating n as a constant and letting the sequence 
length at the root of the tree go to infinity. 
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Proof of Theorem 1: (Molecular clock case) We can restate Lemma 4.1 in the 

following form, where the failure probability is expressed more cleanly in terms 
of the sequence length at the root of the tree. The proof of the lemma is essentially 
the same. 

Lemma 4.2 (Concentration of Distance Estimate) For all a' > 0, there exists 
/cq = n'^ for [5"' > large enough such that if the sequence length at the root is 



Va, 6 e [n], 



-{Ar)+n+\)tab 



< 



l-0{n-k;^)-0{n^-k. 



Repeating the proof of Theorem 2 above, it follows that the algorithm fails to 
reconstruct the phylogeny with probability 0{n ■ k~'^) + 0{ri^ ■ k~^). Letting 
A;^ ^ +00 concludes the proof of Theorem 1. ■ 



5 Extensions 

GTR model We briefly discuss how our results can be extended to GTR mod- 
els. For background on GTR models, see e.g. [Fel04]. Let Q be a reversible 4x4 
rate matrix with stationary distribution vr. Our new sequence evolution process is 
identical to the one described in Definition 1.1 except that the substitution pro- 
cess is a continuous-time Markov process with rate matrix rj^Q. The rate matrix 
Q has 4 nonnegative eignevalues. For convenience, we assume that the largest 
negative eigenvalue is —1. We denote by w the corresponding eigenvector which 
we assume is normalized as follows 

T^sWl = 1. 

se{A,G,C,T} 

We now perform the following transformation of the state space. For a node 
M, let cr„ = (cr^, • • • be the transormed sequence at u where = 

(resp. Wg,Wc,Wj) if the state at site i is A (resp. G, C,T). Note that, under sta- 
tionarity, the expectation of the state at site i is by orthogonaUty of tt and w. 
Then our distance estimator is 
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In particular, in the two-state CFN case, we have w — (+1, —1) and we obtain the 
same estimate as before up to a constant. We now analyze the expectation of this 
quantity. For m e we let 

i=l 

Lemma 5.1 The following holds: 

E[A„ I ar] = e-(''+'^)*A,. (12) 

Remark 5.2 Note that this formula is slightly different than that in Lemma 2.2 
because of the normalization implied by requiring Q to have second eigenvalue 
-1. 



Proof: The sites created after r contribute in expectation. Of course, so do the 
deleted sites. The fraction of sites that survive is e~'**. Suppose site i survives, 
then note that 

E[(7* I (7* = Ws, i survives] = ^(e''**^)^^'^^' = e~^*Ws. 

s' 

Summing over all sites of r we get 

E[A„ I ar] = e-(''+'^)*A„ 

as claimed. ■ 



Consider now a "fork" tree, that is, a root r from which emanates a single edge 
Cu = (r, u) which in turn branches into two edges Ca = (m, a) and Cfe = (m, 6). For 
X — a,h, u, we denote the parameters of edge by t^, \x, JJ-x, Vx- Our goal is to 
compute E[r'(a, b)] assuming that the sequence length at the root is k. The proof 
is similar to Lemma 2.3. 

Lemma 5.3 The following holds: 



E 



V{a,b) 



Q-{Va+IJ.a)ta Q-{rib+iJ-b)ib Q-{lJ-u-K)tu 



Note that Remark 5.2 also applies here. 
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Proof: We have 



E[p(a,6)J = E[A„Ad 

= E [E [A„Afe I o-J] 

= E [E [A, I E [Afe I aj] 

= e~''"*''e~''"*"e~'"'*''e~''''*''E [A^] 

= e"''"*"e"''"*"e~'"'*''e"''''*''E [E [A^ | K^]] 

= e-^»*''e-'^"*°e-'"'*''e-'^''*''E [Var [A„ | 

= e-''''*"e"''"*"e-'"'*''e~''''*''E [X„Var[a^]]] 

= e~''"*''e~''"*°e"'"'*''e~''''*''E [ii'„E[((T^)^]] 

_ Q—Vata Q—IJ.ata Q-Vbtb Q—ntb Q—{lJ-u—>^u)tu 

by Lemma 5.1. ■ 

From the previous lenmias, one can adapt the proofs above to the GTR case. 

Nonclock case Using Lemma 5.3, we can get rid of the molecular clock as- 
sumption. Consider again the fork tree, but assume that each edge is in fact a 
path. An adaptation of Lemma 5.3 gives the following. 

Lemma 5.4 The following holds: 



E 



-In 



V{a,b) 



Note that Remark 5.2 also applies here. 
Proof: Note that 

-ln(A;-^E[K„]) = ^ (/^e - Ae)te, 

eeP(r,a) 

and similarly for h. A variant of Lemma 5.3 gives 

eeP(a,6) eeP(r,M) 

The result follows by subtracting the previous expressions. ■ 

The expression in Lemma 5.4 provides the additive metric needed to extend our 
results to non-clock bounded-rates case. 



In 



E 
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6 Concluding remarks 



We have shown how to reconstruct phylogenies under the Bounded-Rates, GTR 
model with indels. Our efficient algorithm requires polynomial-length sequences 
at the root. A natural open problem arises from this work: 

• Can our results be extended to general trees with bounded branch lengths, 
as opposed to the Bounded-Rates Model? The key difference between the 
two models is that the former may have a linear diameter whereas the latter 
has logarithmic diameter. To extend our results, one would need to deal 
with far away leaves that are almost uncorrelated but for which our block 
structure does not apply. 
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A Useful lemmas 



Recall the following standard concentration inequalities (see e.g. [MR95]): 

Lemma A.l (Chernoff bounds) Let Zi, . . . , Zm be independent {0, l}-random 
variables such that, for 1 < i < m, F[Zi = 1] = pi where < Pi < 1. Then, for 
Z = ET=i Zi, M = E[Z] = EtiPi' 0<6.<l,and0<6+< U, 

F[Z < (1 - 5_)M] < e-^'^-/^ 

and 

¥[Z > {l + S+)M]<e-<^^'^^-, 
where c{U) = [(1 + U) ln(l + U) - U]/U\ 
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